Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2/3 sled-agent] move some common types into their own crate #6122

Conversation

sunshowers
Copy link
Contributor

@sunshowers sunshowers commented Jul 19, 2024

These are core API types used by the bootstrap agent. This is almost entirely
pure code movement with no functional changes.

These types will be exposed via the upcoming bootstrap-agent-api crate, and I
don't want to clutter the API crate too much. In the future we could also
batch-replace these with the new progenitor stuff if desired.

Depends on #6089.

Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
@sunshowers sunshowers changed the title [sled-agent] move some common types into their own crate [2/3 sled-agent] move some common types into their own crate Jul 19, 2024
Created using spr 1.3.6-beta.1
@@ -0,0 +1,606 @@
// This Source Code Form is subject to the terms of the Mozilla Public
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full-admission, I'm sorta skimming over this rather than reading line-by-line because it looks mostly copied

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's fine since it's almost entirely code movement, there are no functional changes so it's more just making sure the code organization looks fine. Thanks!

@@ -1,14 +1,14 @@
use crate::helpers::generate_name;
use anyhow::{anyhow, Context as _, Result};
use chrono::Utc;
use omicron_sled_agent::rack_setup::config::SetupServiceConfig;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call to stop using this alias

@sunshowers sunshowers changed the base branch from sunshowers/spr/main.sled-agent-move-some-common-types-into-their-own-crate to main July 19, 2024 20:49
Created using spr 1.3.6-beta.1
@sunshowers sunshowers merged commit 204ea7d into main Jul 19, 2024
25 checks passed
@sunshowers sunshowers deleted the sunshowers/spr/sled-agent-move-some-common-types-into-their-own-crate branch July 19, 2024 23:11
sunshowers added a commit that referenced this pull request Jul 25, 2024
…cy graph (#6138)

Reorganize some of our data structures and simplify the dependency
graph. In particular, there's now just one copy of `OmicronZonesConfig`
and related types.

This is _mostly_ code reorganization, but there are a couple of
interesting bits in here. Scroll down to the Questions section for more.

## Motivation

### 1. The necessity of sled-agent-types

The primary driver behind this change was turning sled-agent's API into
a trait -- breaking the sled-agent/nexus dependency cycle was one of the
main motivations for that work. To recall, we currently have this
logical cycle in the graph
([reference](https://rfd.shared.oxide.computer/rfd/0479#circular_deps)):

![Screenshot from 2024-07-21
21-03-02](https://github.com/user-attachments/assets/9227d036-ab37-4b1e-b4ae-e390a84e829b)

With the work, we're attempting to move somewhere closer to this graph:

![Screenshot from 2024-07-21
21-07-30](https://github.com/user-attachments/assets/4d93e98e-cfdc-46dc-9244-7fca8b72d5d1)

But to make that happen, some common types need to be moved into a
`sled-agent-types` crate, so that the upcoming `sled-agent-api` crate
can depend on them.

Why not put them in `sled-agent-api` itself? Because:

* In `sled-agent-client`, we often end up normalizing to a shared type
-- so we either want to implement `From` conversions between these types
and `sled-agent-client` types, or to `replace` them outright. So one of
them has to depend on the other.
* If there's even one type that we want to `replace`, the dependency
direction must be `sled-agent-client` -> `sled-agent-api`.
* But the client depending on the API trait is exactly the kind of
circular dependency we're trying to avoid.
* Another reason for trying to avoid a circular dependency here is that
it has the potential to leave types "floating in the ether", just
circulating via the schema with no definition in Rust. (This actually is
a risk today with the way dependencies are set up!)

So the only feasible alternative is to put these types in a shared
crate. This crate is `sled-agent-types`, which we recently introduced in
#6122.

### 2. sled-agent-types <-> nexus-types

For types shared between Nexus and sled-agent, where should we put them?
The options are:

* **omicron-common**: this is where we've been putting these types in
the past, and is not a terrible place to put them. However, if those
types pull in extra dependencies it becomes a bit more problematic,
because omicron-common is depended on by several workspaces outside
omicron.
* A secondary option would be putting these types behind an optional
feature. But that would cause a lot of unnecessary rebuilds of
omicron-common and all of its dependents (essentially splitting the
build unit universe into two), while the option we eventually chose does
not.
* **sled-agent-types** or **nexus-types**: we could put them in one or
the other, but we'd have to introduce a dependency between them at some
point, and they're both logically peers -- neither direction seems like
the right one.
* **nexus-config**: This depends on `tokio-postgres` (and was actually
split out of omicron-common for that reason). It wouldn't be great for
nexus-client or sled-agent-client to transitively pull in
tokio-postgres.
* **a new crate**: This is the option we've chosen. It does add the
complexity of another crate, but resolves all of the previous points.
We're calling this **nexus-sled-agent-shared** to signal that it sits on
top of omicron-common and below nexus-types and sled-agent-types.

Something to consider is whether we want to have two clients in omicron,
one for internal consumers with more dependencies and one for external
consumers with fewer ones. I'm not saying we should do this; it's just
something to keep at the back of one's head.

## The changes

There are a number of closely-related changes here -- most of them had
to be atomic with this one. A description of the changes and the
rationale for them:

### 1. Zone config types

These are `OmicronZonesConfig`, `OmicronZoneConfig`, `OmicronZoneType`,
and `OmicronZoneDataset`. In this PR, I've moved them from `sled-agent`
to `nexus-sled-agent-shared`.

* These types are an important part of the sled-agent API, and
`nexus-types` previously used the progenitor-generated copy of them.
* However that has had some issues for us, e.g.
#4988.
* Personally speaking, having two copies of such a complex type floating
around has been suboptimal for me while I've been writing blueprint
code. For example, previously, ctrl-clicking on `OmicronZonesConfig` in
vscode always pointed me at the progenitor macro which wasn't helpful.
Now, it points at the definition in `nexus-sled-agent-shared`.
* There's an extra transitive dependency this introduces, which is why
they're in `nexus-sled-agent-shared`.

There's one incident that I found along the way that I believe is a
**bug fix**. We have rules for DNS server IP addresses:

* External DNS servers can be IPv4 or IPv6
* Internal DNS servers are always v6

But the Progenitor-generated copy of `OmicronZonesConfig` uses strings
to store address/port pairs. As a result, in some cases we managed to
store an IPv4 address for an internal DNS server. With this PR, this
case is now detected. (I think it would have been ultimately rejected at
some point, but it's possible for the bad DNS server to end up being
persisted.)

### 2. `ZoneKind`

There were previously two dataless enums describing the kind of Omicron
zone floating around.

1. [`ZoneKind` in
`sled-agent-client`](https://github.com/oxidecomputer/omicron/blob/26959bfc1a073c91a920bd521da13a980d137e3f/clients/sled-agent-client/src/lib.rs#L87),
used by Nexus: exactly `OmicronZoneType` without any associated data.
2. [`ZoneType` in
`sled-agent`](https://github.com/oxidecomputer/omicron/blob/26959bfc1a073c91a920bd521da13a980d137e3f/sled-agent/src/params.rs#L265),
used by sled-agent -- identifying the substring used within the service
(notably with `BoundaryNtp` and `InternalNtp` combined into one).

While looking around, I also noticed that there were actually four
different string representations in use for Omicron zones:

1. the zone prefix
2. the service prefix
3. a prefix for the `Name` instance
4. a string used in reports and error messages.

Here, 1-3 are pretty similar to each other (they combine `InternalNtp`
and `BoundaryNtp` down to one, and are generated by sled-agent's
`ZoneType`).
  
There is also a `ZoneType` in nexus-db-model that corresponds to
`ZoneKind`.

Based on this, I decided to combine it all down into one `ZoneKind`
enum, with 4 functions returning 4 string representations: one each for
1-4.
* There's no `Display` impl -- users must choose the representation they
want, which I think is correct.
* I've tried to carefully verify that there are **no behavior changes**
introduced here.
* An alternative would be to have a dedicated representation for 1-3
along with one for 4, but I tried it out and it seemed too unwieldy in
practice. (For example, a few places want to use the zone prefix, but
then report errors using the report string. Having it be one enum with
four representations in string-land makes this easier. I do hope we
won't ever need a fifth, though.)

### 3. Physical disk config

`OmicronPhysicalDiskConfig` and `OmicronPhysicalDisksConfig` are shared
types used by both Nexus and sled-agent. I moved them from
`sled-storage` to `omicron-common`, next to `DiskIdentity` (which was
already in `omicron-common` and which these types use).

* Previously, `nexus-types` used the `sled-agent-client`-generated copy
of these types -- but as discussed in the motivation section, that is no
longer an option.
* Why not `nexus-sled-agent-shared`? Well, `sled-storage` also needs
these types, and I wanted to avoid sled-storage depending on the
additional dependencies in `nexus-sled-agent-shared`.
* I also used `replace` to patch them in so that there's just one copy
of these types throughout sled-agent and Nexus.

### 4. Recovery silo types

These are two types, `RecoverySiloConfig` and `Certificate`, that are
primarily related to Nexus and that `sled-agent` was using via
`nexus-client`. Users of these types needed to be moved into
sled-agent-types, which (as described in the motivation section) cannot
depend on nexus-client. So I've moved them into shared crates.

* `Certificate` is a simple type that can go into `omicron-common`'s
`api::internal::nexus`.
* `RecoverySiloConfig` depends on `omicron-passwords`, so it goes in
`nexus-sled-agent-shared`.

We also patch these types in `nexus-client`. One **behavior change** as
a result is that password hash strength is now validated on **both the
client and the server.** I think this is okay, but it's worth mentioning
explicitly.

### 5. `UserId`

So this is a really interesting one. It needs to be moved into a shared
location because `RecoverySiloConfig` depends on it. But during this
process I noticed that we weren't doing any validation on it in clients
like wicketd. So this counts as a **behavior change**: `UserId` is now
validated on **both the client and the server**.

Note that `UserId` is exposed by the Nexus external API. But this change
itself has no impact on the external API (this can be seen by the fact
that `nexus.json` doesn't have any changes in it)

(I realized that the name should probably be something closer to
`LocalUserId`, but that would be a breaking change to the external API
so I've deferred it. Happy to change the name if it makes sense,
though.)

### 6. Inventory types

These are `Inventory`, `InventoryDisk` and `InventoryZpool`. I've moved
them to `nexus-sled-agent-shared` as a replacement for the previous
scheme where we were using client-generated versions of them.

### 7. `DiskVariant`, `SledRole`, other misc types

These are simple types depended upon by others that have been moved into
shared crates -- locations chosen mostly based on which code depends on
them.

## OpenAPI changes

There are some changes to our OpenAPI specifications.

* Most of the changes are to `description` instances now that they're no
longer going through Progenitor twice.
* There are some validation changes, as documented above.

## Questions

- [x] What should the name of the new crate be?
  - Resolved in 2024-07-23 meeting: **nexus-sled-agent-shared**.
- [x] Verify that the behavior changes documented above are all
reasonable.

## Other notes

It would be nice to have tooling which asserts that e.g. none of the
`-types` crates ever depend on `-client` crates. Haven't written this
tooling though.

Previously I did similar things using
[guppy](https://github.com/guppy-rs/guppy), but that's only available as
a library and can take a while to build (the total number of build units
for `xtask` goes from 254 to 279, which is significant). Turning this
into a binary would solve this but requires some work.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants